Form similarity via Levenshtein distance between ortho-filtered logarithmic ruling-gap ratios
نویسندگان
چکیده
Geometric invariants are combined with edit distance to compare the ruling configuration of noisy filled-out forms. It is shown that gap-ratios used as features capture most of the ruling information of even low-resolution and poorly scanned form images, and that the edit distance is tolerant of missed and spurious rulings. No preprocessing is required and the potentially time-consuming string operations are performed on a sparse representation of the detected rulings. Based on edit distance, 158 Arabic forms are classified into 15 groups with 89% accuracy. Since the method was developed for an application that precludes public dissemination of the data, it is illustrated on public-domain death certificates.
منابع مشابه
Cross-language Phonetic Similarity Measure on Terms Appeared in Asian Languages
This study aims to develop a phonetic similarity measurement method across Asian languages. The method, cross-language similarity algorithm aggregates the transcription of language-specific Romanization, the International Phonetic Alphabet, the Soundex algorithm, and Levenshtein distance. To evaluate the proposed algorithm, this study involves an experiment using ninety-two chemical element nam...
متن کاملLexical similarity can distinguish between automatic and manual translations
We consider the problem of identifying automatic translations from manual translations of the same sentence. Using two different similarity metrics (BLEU and Levenshtein edit distance), we found out that automatic translations are closer to each other than they are to manual translations. We also use phylogenetic trees to provide a visual representation of the distances between pairs of individ...
متن کاملMutual intelligibility of Chinese dialects: Predicting cross-dialect word intelligibility from lexical and phonological similarity
This paper aims to predict mutual intelligibility (defined here as cross-dialectal word recognition) between 15 Chinese dialects from lexical and phonological distance measures. Distances were measured on the stimulus materials used in the experiment. Their predictive power was compared with earlier similar distance measures based on large word lists. Predictors based on just the stimulus mater...
متن کاملAdaptive String Distance Measures for Bilingual Dialect Lexicon Induction
This paper compares different measures of graphemic similarity applied to the task of bilingual lexicon induction between a Swiss German dialect and Standard German. The measures have been adapted to this particular language pair by training stochastic transducers with the ExpectationMaximisation algorithm or by using handmade transduction rules. These adaptive metrics show up to 11% F-measure ...
متن کاملA Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words
We propose a novel knowledge-rich approach to measuring the similarity between a pair of words. The algorithm is tailored to Bulgarian and Russian and takes into account the orthographic and the phonetic correspondences between the two Slavic languages: it combines lemmatization, hand-crafted transformation rules, and weighted Levenshtein distance. The experimental results show an 11-pt interpo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014